Temperature-Zero Hardware Divergence Experiment
Llama 3 8B – CPU vs GPU Inference
Reproduction Instructions
--------------------------------

This repository contains the prompt files and analysis scripts used to test
whether temperature=0 greedy decoding produces hardware-invariant outputs.

The experiment compares CPU and GPU inference using identical prompts,
model weights, and decoding parameters.


ENVIRONMENT
-----------

Ollama version:
    0.17.4

Base model:
    llama3:latest
    ID: 365c0bd3c000
	sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
	
Derived model used for all runs:
    llama3_t0

Modelfile contents:

    FROM llama3:latest
    PARAMETER temperature 0

The derived model ensures all runs use temperature=0 greedy decoding.

Prompt input method:
    Prompts were pasted directly into the Ollama interactive session.
    No system prompt was used.
    Session context was cleared using `/clear` before each run.


HARDWARE
--------

Device:
    Beelink SER9

Processor:
    AMD Ryzen AI 9 HX 370 with Radeon 890M
    2.00 GHz

RAM:
    32 GB (27.6 GB usable)

Operating System:
    Windows 11 Pro
    Version 25H2
    OS Build 26200.7840


EXPERIMENT DESIGN
-----------------

Prompts:
    50 questions per run

Runs per condition:
    3

Prompt regimes:
    BaselineNoJSON
    BaselineWithJSON
    HighRigor
    PushbackStrong

Total generations per hardware per regime:
    150 (50 prompts × 3 runs)

Statistical analysis is performed at the prompt level (n = 50),
using the mean word count across the three runs for each prompt.


STEP 1 — INSTALL THE BASE MODEL
-------------------------------

Open a Windows Terminal (non-admin) and run:

    ollama pull llama3


STEP 2 — CREATE THE TEMPERATURE=0 MODEL
---------------------------------------

Create a file named:

    llama3_t0.modelfile

with the following contents:

    FROM llama3:latest
    PARAMETER temperature 0

Register the derived model:

    ollama create llama3_t0 -f llama3_t0.modelfile

Verify it exists:

    ollama list

Expected output should include something similar to:

    NAME               SIZE     CONTEXT
    llama3_t0:latest   ~5.4GB   8192


STEP 3 — CPU INFERENCE RUNS
---------------------------

Open a new terminal window and run:

    ollama run llama3_t0

At the prompt:

    >>>

clear the session context:

    /clear

Paste the prepared prompt file for the condition being tested:

    BASELINEJSON.txt
    BASELINENoJSON.txt
    HIGH_RIGORJSON.txt
    PUSHBACK_STRONGJSON.txt

Record the full output.

Repeat as needed to obtain three runs per condition.


STEP 4 — GPU INFERENCE RUNS
---------------------------

First stop the Ollama service.

Right-click the Ollama system tray icon and select:

    Quit

Open a new terminal and set the Vulkan GPU environment variables:

    $env:OLLAMA_VULKAN="1"
    $env:GGML_VK_VISIBLE_DEVICES="0"

Start the model:

    ollama run llama3_t0

Verify GPU usage (this test can be performed in another terminal):

    ollama ps

Example output:

    NAME            PROCESSOR
    llama3:latest   100% GPU

At the prompt:

    >>>

clear the session context:

    /clear

Paste the same prompt files used in the CPU runs.

Record the outputs.


IMPORTANT PROCEDURAL NOTES
--------------------------

1. The `/clear` command must be issued before every run to remove residual
   session context.

2. Identical prompt files must be used for CPU and GPU runs.

3. Outputs were recorded exactly as produced by Ollama.

4. No post-processing was applied except extraction of numbered items
   (1–50) during analysis.

5. Header metadata (timestamps, system status output) is ignored by the
   analysis script.


OUTPUT ANALYSIS
---------------

All files were saved in the same directory for analysis.

Analysis Command
----------------

The results reported in the paper were generated using:

    python cpu_gpu_stats.py --base_dir . --out_csv results.csv

The script extracts numbered items (1–50) from the raw output logs
and computes:

    • within-hardware determinism
    • CPU vs GPU divergence
    • mean word counts
    • paired t-tests
    • Wilcoxon tests
    • Cohen's dz effect size

Output File Naming Convention
-----------------------------

Output files follow the pattern:

    CONDITION_PROCESSOR_T0_RUN.txt

Example:

    BaselineNoJSON_CPU_T0_C1.txt

Where:

    CONDITION   = prompt regime
    PROCESSOR   = CPU or GPU
    T0          = temperature = 0

    RUN         = execution sequence identifier

        C1  = Cold start run (first execution after starting Ollama)
        W1–W3 = Warm runs (subsequent executions in the same terminal session)
        A1–A3 = Additional runs executed in a separate terminal session

Replication Checklist
---------------------

To reproduce the experiment:

1. Install Ollama 0.17.4
2. Pull llama3 model
3. Create llama3_t0 using the provided Modelfile
4. Run CPU and GPU inference as described
5. Record outputs to text files
6. Run cpu_gpu_stats.py to compute results

INTERPRETATION BOUNDARY
-----------------------

This experiment evaluates hardware-dependent differences in LLM output
under temperature=0 greedy decoding.

The results demonstrate that deterministic decoding does not guarantee
identical outputs across different inference hardware.

The study does not attempt to determine the underlying numerical or
kernel-level cause of the observed divergence.